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ABSTRACT 


Early predictors of student success are becoming a key tool 
in flipped and online courses to ensure that no student is left 
behind along course activities. However, with an increased 
interest in this area, it has become hard to keep track of what 
the state of the art in early success prediction is. Moreover, 
prior work on early success prediction based on clickstreams 
has mostly focused on implementing features and models for 
a specific online course (e.g., a MOOC). It remains there- 
fore under-explored how different features and models enable 
early predictions, based on the domain, structure, and edu- 
cational setting of a given course. In this paper, we report 
the results of a systematic analysis of early success predic- 
tors for both flipped and online courses. In the first part, we 
focus on a specific flipped course. Specifically, we investigate 
eight feature sets, presented at top-level educational venues 
over the last few years, and a novel feature set proposed in 
this paper and tailored to this setting. We benchmark the 
performance of these feature sets using a RF classifier, and 
we provide and discuss an ensemble feature set optimized for 
the target flipped course. In the second part, we extend our 
analysis to courses with different educational settings (i.e., 
MOOCs), domains, and structure. Our results show that 
(i) the ensemble of optimal features varies depending on the 
course setting and structure, and (ii) the predictive perfor- 
mance of the optimal ensemble feature set highly depends 
on the course activities. 
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1. INTRODUCTION 


An increasing number of universities are now running blended 
courses that combine traditional lectures with online instruc- 
tion, providing educational models tailored to the needs of 
our society [20]. A popular instructional strategy to enable 
blended learning is represented by flipped classrooms, where 
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students complete pre-class activities before attending face- 
to-face lessons [18]. Recent studies have shown the positive 
impact and dependency of this strategy on student-centered 
variables such as self-efficacy and self-regulation [22, 17, 6, 
19]. Pre-class activities usually consist in watching videos 
and completing quizzes part of Massive Open Online Courses 
(MOOCs) used as supplementary material [27]. Each week, 
students are asked to perform these pre-class activities and 
to then complete exercises and have discussion in class. Pre- 
class activities are fundamental for the success of flipped 
courses [12, 21, 28]. However, students often lack skills, 
time, and motivation to regulate their pre-class activity; as 
a consequence, they may experience difficulties in class and 
end up failing the course [10, 14]. To ensure that no learner 
is left behind, Early Success Predictors (ESPs) are becom- 
ing crucial to support instructors in identifying and timely 
acting upon risk factors of failing a course. 


So far, there are few studies on analyzing student success in 
flipped courses based on pre-class activities. For instance, 
Jovanovic et al. [9, 8] clustered interaction sequences in 
pre-class clickstreams to identify learning strategies, showing 
how strategy-based student profiles differ in course grades. 
Beatty et al. [2] found that frequency counts of video us- 
age are often correlated with course grades in flipped class- 
rooms. In blended, but not flipped settings, Akpinar et 
al. [1] showed that student’s strategy counts, with strate- 
gies modelled as clickstream event n-grams, are indicative 
of course homework grades. Wan et al. [25, 26] trained gra- 
dient boosting classifiers on an extensive set of clickstream- 
based features to identify at-risk students in a small private 
online course delivered in hybrid mode. They also analyzed 
the importance of the features, finding that the time spent 
in online activities and the stability of time distribution dur- 
ing weeks have the highest importance in that course. To 
the best of our knowledge, no prior work on flipped courses 
specifically focused on ESPs. 


Conversely, there is a large body of research on success pre- 
diction for fully online courses (e.g., MOOCs). A multitude 
of feature sets have been extracted from clickstreams for 
this purpose. Recent work proposed video-counting (e.g., 
number of videos viewed per week, rewinds, fastforwards, 
pauses, and plays, and the fractional and total amount of 
time played and paused for videos) and session-based (e.g., 
number of sessions, mean and standard deviation of the time 
for all sessions and between sessions) features [4, 13]. These 
features were fed into different commonly used classifiers 
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(e.g., Logistic Regression, Naive Bayes, Decision Tree, RF, 
and Neural Networks) to predict success in weekly assign- 
ments or in the entire specific MOOC. In [8], several fea- 
tures that measure intra-course, intra-week and intra-day 
regularity in video watching were proposed, and their corre- 
lation with the course grade was shown. Other researchers 
leveraged attendance rates, usage rates, and watching ra- 
tios [7, 15]. Specifically, they explored how the difference 
in these indicators affects academic performance, showing 
that students whose indicators are high are more likely to 
graduate on schedule. More fine-grained features on video 
usage (e.g., total video views, mean and standard deviation 
of the proportion of videos watched, re-watched, and inter- 
rupted per week, and the frequency and total number of all 
video actions and of each type of video action) were pro- 
posed in [11]. The authors clustered students according to 
their watching behavior and found that such a behavior is 
representative of course performance. Similarly, Mubarak 
et al. [16] extracted implicit features from video-clickstream 
data, and investigated the extent to which neural networks 
fed with those features can predict weekly students’ perfor- 
mance. For an extensive discussion on success prediction in 
MOOCs, we recommend this survey [5]. 


The above features and classifiers, however, are designed 
for fully-online learning contexts, such as MOOCs. Despite 
clear connections, there are essential aspects which distin- 
guish flipped courses from MOOCs. First of all, flipped 
course data includes relatively few students. A large part of 
the learning activity happens offline and cannot be tracked, 
leading to data only on course segments. Flipped courses 
generally have also an intense instructor guidance and per- 
formance on them has direct impact on the academic port- 
folio. As a motivating example, we consider a flipped course 
on Linear Algebra later described in this paper and the reg- 
ularity features proposed for MOOCs in [3]. They quan- 
tify students’ time regularity by considering their activi- 
ties over the course (e.g., studying at the same days of 
the week). Boroujeni’s study revealed that the final grade 
in the MOOC is correlated with two intra-week regularity 
measures and the periodicity of day hour and week hour 
(.46 <c< ..7, p < 0.001). Conversely, the same features 
resulted to have no correlation with the final grade in the 
above flipped course (.0 < c < .1, p < 0.001). Therefore, 
it remains unexplored whether existing features and classi- 
fiers for MOOCs generalize to different educational settings 
(e.g., flipped classrooms), and to what extent the feature 
importance varies according to the topic, structure, and ed- 
ucational setting of the course. 


The contribution of this paper is two-fold: we tackle the 
problem of ESPs in flipped classroom settings’, and we pro- 
vide an extensive analysis and benchmark of classifiers and 
features for early success prediction across different types of 
courses, namely MOOCs and flipped courses. A schematic 
overview of our analysis in this paper is shown in Figure 1. 


In a first step, we propose a novel feature set for early suc- 
cess prediction in flipped courses. Our feature set mea- 
sures students’ alignment, anticipation, and strength in quiz 
and video usage. We benchmark our new feature set using 
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a Random Forest (RF) classifier against eight feature sets 
presented in previous work on success prediction in online 
courses. We retrieved these feature sets by systematically 
scanning the recent papers published at major educational 
venues (e.g., EDM, AIED, etc.) and reproducing the fea- 
tures based on the relevant papers. Our results on data 
of 214 students enrolled in a linear algebra flipped course 
show that the novel feature set outperforms all previously 
suggested feature sets. We also show that predictive per- 
formance can be increased by selecting the optimal features 
from the ensemble of all feature sets. 


In a second step, we extend our analysis to further courses 
along three dimensions: domain, structure, and educational 
setting. We compute the early predictive performance again 
using a RF classifier for three additional courses: a flipped 
course on functional programming (where pre-class activities 
include videos only), a MOOC on linear algebra (including 
video and quiz activities), and a MOOC on functional pro- 
gramming (including video activities only). For each course, 
we select the optimal features from the ensemble of feature 
sets (eight feature sets from prior work and one novel feature 
set from this paper) as input features for the RF classifier. 
Our results show that the structure of the course signifi- 
cantly influences performance. Predictive performance for 
courses including quizzes is much higher than for courses in- 
cluding only videos. Furthermore, we also show that while 
there is some overlap between the optimal features across 
courses, the importance of the features highly depends on 
the setting and structure of the course. 


2. EARLY PREDICTION FORMULATION 


The problem addressed in this paper can be framed into a 
time series classification task that relies on clickstreams to 
predict student success in a course. For clarity and repro- 
ducibility, we present and formalize the addressed problem. 


Course. Early success predictions are provided in the con- 
text of a course (e.g., a MOOC or a course run in a flipped 
classroom setting). In what follows, we hence mathemati- 
cally define fundamental concepts, such as the course sched- 
ule, the learning objects, and their properties. Specifically, 
we consider a set of students U who are enrolled in a course 
c part of the online educational offering C. Each course 
c € C has a pre-defined schedule S, consisting of N = |S<| 
online activities, such that S. = {s1,...,sn~}. We assume 
that each online activity s; included in the course schedule 
is represented by a tuple (0,;,t,;), consisting of learning ob- 
ject o; € O and its corresponding completion deadline for 
students t; € R*, modelled as a timestamp. Each learn- 
ing object o € O is characterized by descriptive properties 
denoted with an M-dimensional vector fo = (fi,..-, faz) 
over a set of features F = {Fi,..., Fac} that vary according 
to the type of the learning object (e.g., the duration for a 
video or the maximum grade for a quiz). Specifically, each 
feature F; € F can be envisioned as a set of discrete or con- 
tinuous values describing an attribute of a learning object 
0, fo, € F; for 7 = 1,...,M. Our study in this paper as- 
sumes that learning objects can be either videos or quizzes, 
but the notation can be easily extended to other types (e.g., 
forum posts or readings). The type of a learning object 
o € O is returned by a function type : O > {video, quiz}. 
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Figure 1: Our Framework. We first analyze a flipped course with videos and quizzes, then investigate differences between courses 
in flipped and MOOC settings and with videos only and videos plus quizzes. Eight state-of-the-art feature sets, a novel feature 
set, and feature ensembles are computed for each student and each week of the course. Weekly features are averaged and a 
success label is attached, according to the course type. Classification is performed using a Random Forest. Observations and 
recommendations on the predictive power of features are provided for each course setting, highlighting open challenges. 


Based on common log data collected by educational plat- 
forms, we assume that learning objects of type video, de- 
noted as O”'*” = {0 € O|type(o) = video}, are described 
by properties associated to the video duration in seconds as 
F“'4ee — (duration € R*). Learning objects of type quiz, de- 
noted as O%""* = {0 € O| type(o) = quiz}, are characterized 
by descriptive properties that model the maximum grade 
students can achieve in that quiz as F“"* = (maxgrade € 
Rt). For convenience, we use superscripts to denote a de- 
scriptive property of a learning object. For instance, the 
duration of a video o € O’'?” can be referred to as 0%”7%0", 
The same notation applies to other descriptive properties. 


Interaction. Students enrolled in an online course interact 
with the learning objects included in the course schedule, 
generating a time-wise clickstream. We denote a clickstream 
in a course c € C for a student u € U as atime series [,,, such 
that Iu = {i1,...,i«K}, with kK € N (e.g., a sequence of video 
plays and pauses, quiz submissions, and so on). We leave 
these definitions very general on purpose, in particular allow- 
ing the length of each time series to differ, since our models 
are inherently capable of handling this. Likewise, we neither 
enforce nor expect all time series to be synchronized, i.e. 
being sampled at the same time, but rather we are fully ag- 
nostic to non-synchronized observations. This configuration 
is common in educational time series. We assume that each 
interaction i; is represented as a tuple (t;, a;,0,;,d;), consist- 
ing of a timestamp t; € R™, the action a; € A performed by 
the student (e.g., play or pause), the learning object 0; € O 
involved in the action a; (e.g., a certain video or quiz), 
and an L-dimensional descriptive vector d; = (d1,...,dz) 
over a set of features D = {Dj,...,Dz}. These descrip- 
tive vectors are used to append relevant information to an 
action a; performed at time t;, such as the current video 
time when the action occurred or the grade received by 
the student on a quiz. Based on the type of the learning 
object o € QO, the student can perform different actions 
A. We assume that video interactions, denoted by {i; = 
(t;,a;,0;,d;) € lu | type(o;) = video}, are limited to actions 
a; € A”? = {Load, Play, Pause, Stop, SpeedChange, Seek}. 
These actions are derived from those commonly allowed to 
students in online educational platforms. Conversely, quiz 
interactions, denoted by {i; = (t;,a;,0;,d;) € lu | type(o;) = 
quiz}, include actions a € A”™” = {Submit}. 


In online educational platforms, clickstream interactions in- 
clude a payload with additional information beyond the times- 


tamp, the action, and the involved learning object. For in- 
stance, if a student submits a quiz, the resulting interaction 
includes also the grade assigned by the system to the stu- 
dent’s quiz. Our notation models each dimension D; € D 
of a clickstream interaction as a set of discrete or contin- 
uous values describing the interaction 7; € Iu, dj. € Di 
for] = 1,...,L. Specifically, we assume that interactions 
involving base video actions {i; = (t;,a;,0;,d;) € lu|a; € 
{ Load, Play, Pause, Stop}} include descriptive properties as- 
sociated to the current video time the interaction occurred, 
ie. D?** = (current-time € Rt). Interactions involving a 
speed change in a video, denoted as {i; = (t;,a;,0;,d;) 
Iu la; € {SpeedChange}}, are characterized by descriptive 
properties associated to both the old and the new speed 
the video has been and will be watched, i.e. D°?°e¢@har9e = 
(oldspeed € R*,newspeed € R™). Interactions generated 
by students while seeking the video backward or forward, 
denoted as {i; = (t;,a;,0;,d;) € lula; € {Seek}}, are 
modelled by descriptive properties related to the previous 
and current video time the student moved on, i.e. D°°* = 
(oldtime € R*,newtime € R*). Finally, submit interactions 
generated in quiz activities, denoted as {i; = (t;,a;,0;, d;) 
I. |aj; € {Submit}}, include descriptive properties on the 
grade assigned to the quiz answer and the progressive num- 
ber of the attempt made on that quiz, i.e. D°“?"’ = (grade € 
Rt,subnum € R*), with grade € [0, 1]. 


an 


an 


For convenience, we denote as Ij, the clickstream including 
interactions i; € 1,, such that t; < t Vt; € Ii,, namely those 
occurred before time t. Similarly, since online activities in 
MOOCs and flipped courses are organized on a weekly basis, 
tw identifies the time t where the course week w ends. For 
instance, the clickstream of user u generated till the end of 
the second week can be denoted as If?. 


Success Label. Once interactions are modelled, we need to 
associate a success label according to the final grade the cor- 
responding student has received for that course. We consider 
a dataset G to consist of tuples, i.e. G = {Uu;, Yu, )}, where 
I,,, denotes the interactions of student u; and yu, € {0,1} 
the pass-fail label or the above-below average grade label. 


Feature Extraction. Machine-learning models rarely receive 
raw interaction sequences, as so we abstract such interac- 
tions through a feature extraction step. Given the interac- 
tions Ii” Cc 1, € I, generated by student u till the course 
week w € N, we produce fixed-length representations in 
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H c R”**", where H € N is the dimensionality of the fea- 
ture set. Therefore, we assume that H-dimensional vectors 
are extracted for each week. For instance, if the feature set 
includes ’*number of sessions” and ”number of clicks”, feature 
vectors of size H = 2 are extracted each week. Formally, the 
extraction process is denoted as H : 1 > H, from interac- 
tions to features. 


Model. Given the dataset G with interactions - success label 
pairs, an early success predictor € aims to predict the success 
label yu; associated to the interactions I,,. Formally, this 
operation can be abstracted as a function yu; = E(Iu,;|9), 
where y,, denotes the predicted label, @ denotes the model 
parameters, and € denotes the predictive function that maps 
interactions I,,, to the predicted label y.,; according to 0. 


Objective Function. Hence, training an early success predic- 
tor € with interactions - success label pairs till course week w 
becomes an optimization problem, aimed to find model pa- 
rameters 0 that maximize the expectation on the following 
objective function (i.e., predicting the correct success label, 
given the interactions) on a dataset G: 


Yu = E(H(I,.”) | 8) (1) 


0 = argmax E 
0 (lu ,Yu)€G 


In this paper, we focus on feature extraction, which formally 
results in the operationalization of the function H. 


3. REPRESENTATIVE FEATURE SETS 


To make sure that our work is not only based on individual 
examples of published research, we systematically scanned 
the proceedings of conferences and journals for relevant pa- 
pers in a manual process. In our analysis, we considered 
papers that appeared in the last years in the top educa- 
tional technology conferences (e.g., LAK, EC-TEL, AIED, 
and EDM) and journals (e.g., IEEE TLT, Springer EIT, 
Journal of Learning Analytics). We considered a paper to 
be relevant if it (a) proposed a novel feature set for course 
success analysis, and (b) focused on the context of online 
courses or courses with online activities. Papers on other 
tasks, e.g., prediction of affective state or conceptual un- 
derstanding, or other educational contexts, e.g., interactive 
simulations or games, were not considered. Moreover, pa- 
pers with highly overlapping feature sets were filtered, and 
the paper with the most extensive set was used as represen- 
tative. Finally, eight papers were included in our study. 


In a next step, we reproduced the feature sets described 
in the above papers. Our approach was to rely as much 
as possible on the artifacts provided by the authors them- 
selves, i.e., their source code and the descriptions included 
into the papers. In theory, it should be possible to repro- 
duce published results using only the technical descriptions 
in the papers. In reality, there are many tiny implemen- 
tation details with an impact on experiments. Overall, we 
could reproduce with reasonable assumptions all eight fea- 
ture sets based on the relevant papers. In what follows, we 
give a description of each feature set included in our study. 


AkpinarEtAl. This feature set consists of consecutive sub- 
sequences of n clicks extracted from the session clickstreams 
of a blended course [1]. In addition to sub-sequences, the 


authors considered four features related to the number of 
clicks, the number of session clickstreams, and attendance 
information. Note that in comparison to the original pa- 
per, we extract sequences from a different set a raw events, 
namely only videos and quizzes (e.g., no events on forums). 
Hence, in our case the feature set has a size of |A’’?UA"|" 
features per student. Since we expect short patterns to be 
un-interpretable and particularly long patterns to be rare, 
we choose n = 3 for our analyses. 


BoroujeniEtAl. This feature set was originally used to mea- 
sure to what extent MOOC students are regular in their 
study patterns [3]. Specifically, it is considered whether stu- 
dents study on certain hours of the day, day(s) of the week or 
similar weekdays. Other features monitor whether students 
have the same distribution of study time among weekdays 
over weeks, particular amount of study time on each week- 
day, and finally to what extent a student follows the sched- 
ule of the course. This set includes 9 features per student. 
Other papers proposed similar regularity features [23, 24, 8, 
9, 2). We limit our analysis to the feature set listed in [3], 
as in our first experiments it exhibited the best predictive 
power (among papers focusing on regularity features). 


ChenCui. The feature set presented in this paper [4] includes 
click countings from a mandatory undergraduate course run 
through Moodle. Features include the number of total clicks 
and of clicks on campus, the ratio of on-campus to off- 
campus clicks, the number of online sessions (with average 
and standard deviation), standard deviation of time between 
online sessions, number of clicks during weekdays or week- 
ends, ratio of weekend to weekday clicks, and the number of 
clicks for each type of module (e.g., assignment, forum, and 
quiz). To accomodate the scenario presented in Section 2, 
our study does not cover the features not easily generaliz- 
able to different types of online courses: the number of clicks 
on campus, the ratio of on-campus to off-campus clicks, the 
number of clicks for modules file, forum, report system. We 
therefore obtain a feature set of size 13 for each student. 


LalleConati. This paper [11] focuses again on MOOCs. The 
presented feature set is composed by video interaction fea- 
tures at two levels of granularity. Features on video views 
include the total number of videos views (both watches and 
rewatches), in addition to the average and standard devi- 
ation of the proportion of videos watched, re-watched, and 
interrupted per week. On the other hand, features on actions 
performed within the videos include the frequency and total 
number of all performed video actions, frequency of video 
actions for each type of video action, and the average and 
standard deviation duration of video pauses, seek lengths, 
and so on. This feature set has a size of 22 per student. 


LemayDoleck. The next paper [13] is also focused on MOOCs. 
Presented features include the number of videos watched 
per week, the average time fraction paused, played or spent 
watching, the average and standard deviation of the play- 
back rate, and the total number of rewinds, pauses, and 
fast-forwards. Note that this feature set includes only video- 
related measures, resulting in vectors of size 10 per student. 


MbouzaoEtAl. In this MOOC paper [15], the authors in- 
troduce three novel features, namely attendance rate, uti- 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 153 


lization rate, and watching ratio. The attendance rate of a 
student on a given week is the number of videos the stu- 
dent played over to the total number of videos in that week. 
The utilization rate is the proportion of video play time ac- 
tivity of the student over the sum of video lengths for all 
videos on that week. Finally, the watching ratio is defined 
as the product between the two former features. This 3- 
sized feature set has been tested in MOOCs, extending an 
already-existing feature set [7]. 


MubarakEtAl. This paper [16] is primarily focused on im- 
plicit features about video-usage behavior in MOOCs. Com- 
posed by 13 features, this set covers fine-grained characteris- 
tics, such as the percentage of the video the learner watched 
not counting repeated segments, the amount of real time the 
learner spent watching the video (i.e. when playing or paus- 
ing) compared to the video duration, and the sum of times 
a learner viewed a video in its entirety. 


WanEtAl. This set was designed for a small private online 
course [26]. Features measure the online learning time, the 
strength of the learner’s engagement in forums and weekly 
assignments, the extent to which students attempt to do 
the homework soon, as examples. ‘Table 1 and 2 in the 
cited paper provide further details. Given that we do not 
cover forum interactions, our study does not consider forum 
features. Finally, this set includes 14 features per student. 


4. EARLY PREDICTORS IN FLIPPED 
COURSE SETTINGS 


In this section, we first present a novel feature set for flipped 
courses, based on alignment, anticipation, and strength in 
content usage. We then describe the experimental setup 
and results aimed to assess to what extent the feature sets 
(including ours) are predictive of student success. 


4.1 Our Feature Set 


The feature sets presented so far mainly tackle video-related 
features and/or consider only low-level features, with only a 
few of them including features related to quizzes or assign- 
ments. Considering that predicting the success of a student 
based on clickstream data only is a challenging task per sé, 
we believe that limiting features to those extracted from 
videos may result in inferior predictive performance. We 
therefore suggest a number of additional features assessing 
students’ knowledge and alignment with the course schedule. 


Competency Strength is defined as the average of the in- 
verse number of submissions for a quiz, weighted by the 
highest grade achieved by the student on that quiz. Given 
the inverse term, the value of this feature decreases when 
the student attempts the quiz multiple times and if the 
grade achieved by the student on the last attempt is not 
the highest-possible one. Hence, good-performing students 
may use few attempts and reach the maximum quiz grade 
fast (value close to 1). Students struggling with the material 
may attempt the quiz many times and not reach the max- 
imum grade (value close to 0). Given a student u and the 
week w of the course, this feature is computed as: 


1 


1 q 
iOul S- OF max(Gi) (2) 


qeQu 


where: 


© Qu = {olt; = (tj,4;,0;,d;) € ly M type(o;) = quiz n 
t; <tw} are the quizzes taken by student u till week w. 


© Qi = [ttslis = (ty, 49,07,dj) € lu N 07 =QNty < tw}| 
is the number of attempts a student had on quiz q. 


GC) S4a "= (Gyagord) eluitep=gny= 
tw} is the set of grades a student got on quiz q. 


Competency Alignment is defined as the number of quizzes 
the student received the maximum grade until week w, di- 
vided by the total number of quizzes scheduled for the period 
of consideration. Good-performing students may receive the 
maximum grade in all quizzes for the period of consideration 
(value close to 1); low-performing students may be behind 
the schedule and pass fewer quizzes than those proposed 
(value close to 0). Given a student u and the week w of the 
course, this feature is computed as: 


|Qpass a gleatw)| (3) 
| Steq(tw)| 


where: 


© Qi = f{ojltj = (tj,4;,0j,dj) € Iu M type(oj) = 
quiz m deeds — eames is the set of quizzes the 
student wu received the maximum grade until week w. 


e gleatw) — fo, € Ol(o;,t;) € Se N type(o;) = quiz N 
t; <tw} is the set of quizzes to complete by week w. 


Competency Anticipation is defined as the number of quizzes 
attempted by the student among those in subsequent weeks 
of the current week of study. This feature can be seen as a 
proxy of the learning propensity of a student. For instance, 
if a quiz is scheduled to be solved in subsequent weeks, we 
expect that good-performing students try them earlier, an- 
ticipating the deadline stated in the platform (value close to 
1). Low-performing students may delay the consumption of 
quizzes across weeks or even towards the end of the course 
(value close to 0). Given a student u and the week w of the 
course, this feature is computed as: 


Qu Nn Sw) | 
| S.9t ew ) | (4) 
where @, is the set of quizzes taken by student u until week 
w as defined in Eq. 2, and: 
e 99%Ew) — fo; € Ol(o;,t;) € Se N type(o;) = quiz Nt; > 
tw} is the set of quizzes to complete after week w. 


Content Alignment is defined as the number of videos watched 
by the student until week w, divided by the total number 
of videos scheduled for the period of consideration. Good- 
performing students are expected to complete all videos for 
the period of consideration (value close to 1), while low- 
performing students may complete less videos than those 
proposed (value close to 0). Given a student u and the week 
w of the course, this feature is computed as: 


Vets | (5) 
| Stea(tw) | 


where: 


e V,, = {0;|i; = (t;,a;,0;,d;) € Iu NM type(o;) = video} 
is the set of videos watched by student wu until week w. 
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e slealtw) — fo; € O|(0;,t;) € Se N type(o;) = video N 
t; <tw} is the set of videos to watch by week w. 


Content Anticipation is defined as the number of videos com- 
pleted by the student among those in subsequent weeks of 
the current week of study. For instance, if a video is due 
the next week, we expect that good-performing students 
might watch them earlier, anticipating the deadline stated 
in the platform (value close to 1). On the other hand, low- 
performing students may tend to delay the completion of 
videos (value close to 0). Given a student u and the week w 
of the course, this feature is computed as: 


Vu ‘a Sottw)) 

eS © 
where V,, is the set of videos watched by student u until 
week w as defined in Eq. 5, and: 


e 99%tw) — fo, € Ol(o;,t;) € Se N type(o;) = video N 
t; >tw} is the set of videos to watch after week w. 


Student Shape is defined as the student’s tendency of re- 
ceiving the maximum grade in a quiz at the first attempt 
in a row. Good-performing students are expected to con- 
secutively receive the maximum grade in quizzes at the first 
attempt (value close to 1); students experiencing difficulties 
may require multiple attempts on each quiz, before getting 
the maximum grade (value close to 0). Given a student u 
and the week w of the course, this feature is computed as: 


DapitsePPi @TDep HPP hs) EP OG =1l +e 
where P = {(po,lo),..-, (Pn,ln)} represents a series count- 


ing how many quizzes the student consecutively receives the 
maximum grade (1; = 1) or failed (J; = 0) at the first at- 
tempt in a row. For instance, if a student gets the maximum 
grade for the first five quizzes at the first attempt in a row, 
then is wrong in two quizzes at the first attempt, and then 
receives the maximum grade for ten quizzes at the first at- 
tempt in a row, P would be equal to {(5, 1), (2,0), (10, 1)}. 


Student Speed is defined as the average time passed between 
two consecutive attempts for the same quiz, among those 
taken by the student. This feature captures intrinsic behav- 
ior of students who take the quiz, spending less time or more 
time to attempt it, on average. Given a student u and the 
week w of the course, this feature is computed as: 


ltl 43 i-1 
1 eet | 

ei a (8) 

Qu = Ital 
qeQy i=1 
where @,, is the set of quizzes taken by student u until week 

w as defined in Eq. 2, and: 
© tq = [tj|(tj,a;,0;,dj) € lu MN 07 =q N tj > tj-1)] are 
timings between trials for u on q, chronologically. 


In the rest of the paper, we will refer to our set by Ours. 


4.2 Experimental Evaluation 

In this section, we benchmark our new feature set against 
the eight feature sets presented in prior work (see Section 
3), on early success prediction in flipped courses. For con- 
venience, we will use author-based labels to identify feature 
sets throughout the paper, but we will be more interested in 
contrasting the impact of features in those papers based on 
what they implicitly measure (not based on the authors). 


4.2.1 Experimental Setup 


Protocol. For each dataset, we applied a train-test eval- 
uation, i.e. parameters were fit on the training data set 
and the performance of the models was evaluated on the 
test data set. We performed all experiments using Random 
Forest (RF) classifiers, known to achieve a good trade-off 
between prediction accuracy and interpretability. Perfor- 
mance of all models was computed using a nested student- 
stratified (i.e. dividing the folds by students) 10-fold cross 
validation. The same folds were used for all experiments, 
across feature sets. We optimized the hyper-parameters 
of RFs via Grid Search in Scikit-Learn. Specifically, we 
tuned the following hyper-parameters: number of estimators 
(25, 50, 100, 200, 300, 500), the maximum number of features 
(sqrt, None, log2), and the splitting criterion (gini, entropy). 
More extensive grids were run, but they did not show any 
substantial improvement. ‘Io be precise, we determined 
the set of optimal hyper-parameters as follows: within each 
iteration, we ran an inner student-stratified 10-fold cross- 
validation on the training set in that iteration, and selected 
the combination of hyper-parameter values yielding the high- 
est accuracy on the inner cross-validation. Note that we 
trained RFs by weeks: the RF for week w of a given course 
was trained on data collected up to week w. To obtain the 
input features for RF for week w, we computed the weekly 
features for the selected feature set and averaged them. 


Data Set: LA-Flip. We consider a Linear Algebra course 
for undergraduate students taught in a flipped format for 
10 weeks at EPFL. Typical pre-class work included a list 
of video lectures and online quizzes from a Linear Algebra 
MOOC. The final exam grade, lying between 0 and 6, with 4 
as passing threshold, is considered as a measure for students’ 
performance. The repeating students were filtered out, given 
that their repeated exposure to the material might add a 
bias to our findings. The final dataset consists of clickstream 
data from 214 students, with 41% of them failing the course. 
The study was approved by the university’s ethics committee 
(HREC No. 058-2020/10.09.2020). 


4.2.2 Observations 

We evaluated the predictive accuracy of RF classifiers trained 
on the different feature sets extracted from LA-Flip under a 
binary classification that aims to identify passing and fail- 
ing students early, as described in Section 2. We further 
also trained RF classifiers only on the most important fea- 
tures selected from all features (denoted as EnsembleAll) 
and from all features except ours (denoted as EnsembleB- 
utOurs). Figure 2 reports the balanced accuracy, the area 
under the ROC curve (AUC), and the individual percentage 
of passing and failing students correctly identified (recall) 
for each feature set over all weeks and folds. 


The lowest-performing feature sets appear those monitoring 
students’ regularity (orange) and attendance and utilization 
rates (blue). Hence, a first conclusion we can draw is: 


Highlight #1. Regularity and attendance/utilization fea- 
tures, powerful in MOOCs, do not allow to distinguish pass- 
ing from failing students in the considered flipped course. 


The feature sets mostly related to video-clicking behavior, 
such as those from Lemay & Doleck, do not lead to substan- 
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Figure 2: [LA-Flip]. Effectiveness of a RF classifier trained on 
separate feature sets and on ensembles. Our feature set is es- 
sential to increase the effectiveness of the classifier, especially 
in terms of Non-Succeed (failing students) Recall. 


——— AkpinarEtAl — LalleConati —— Ours 
— ChenCui — EnsembleButOurs EnsembleAll 


Week 


Figure 3: [LA-Flip]. AUC for the best six feature sets. The 
ensemble of all features (grey) leads to an increase in effec- 
tiveness, with respect to considering feature sets separately. 


tial differences from each other and all achieved a balanced 
accuracy between 55% and 59% (similarly for AUC). This 
finding might reveal an intrinsic limit for video features in 
predicting student success from pre-class activities. Our re- 
sults also raise the question on how and why a certain type 
of video features should be preferred compared to others. 


Highlight #2. Jn this flipped course, there are minimal 
differences in performance among video-usage features; an 
intrinsic predictive limit for video-usage features exists. 


This motivates investigation on the impact of features tar- 
geting quiz usage. In this direction, the features proposed 
by Wan et al. cover a range of raw counting and timing mea- 
sures that target quizzes. Figure 2 shows that this feature 
set is even worse that just using video features. Conversely, 
by measuring more complex patterns in quiz consumption, 
our feature set led to a balanced accuracy of 67% (simi- 
larly for AUC). To identify the aspect our features make the 
difference at, we considered the percentage of passing and 
failing students correctly classified, as shown in the two bot- 
tom plots in Figure 2. While there are no substantial differ- 
ences among our feature set and the other ones in identifying 
passing students (Succeed Recall), a clear improvement is 
obtained in the detection of failing students (Non-Succeed 
Recall), fundamental to ensure fewer students are left be- 
hind. The impact of our features can be also appreciated 
across weeks in Figure 3. Our features allowed the ensemble 
to be effective in the first weeks, while both ours and other 
features jointly led to an improvement in the second part of 
the course. Given our results and the characteristics of our 
features, we can observe that: 


Highlight #3. Extracting fine-grained features that model 
alignment, anticipation and strength of video/quiz usage 
results in higher predictive power on failing students. 


Though considering the feature sets separately allowed us 
to perform a fine-grained assessment and have an estima- 
tion of their predictive power, it remains unclear how the 
effectiveness of early predictors can be improved by training 
models with an ensemble of all features and to what extent 
the importance of the considered features varies. Hence, on 
the right side of the plots in Figure 2, we present the results 
achieved by a RF classifier only with the most important 
features selected from all features and from all features ex- 
cept ours. It can be observed that the optimal ensemble 
of features without ours results in lower performance, com- 
pared to the optimal ensemble that uses also our features. 
The optimal ensemble of all feature has an AUC score con- 
sistently higher than 0.70. To inspect what drives success 
prediction, we computed the feature importance over weeks 
and folds, and reported in Figure 4 the importance of fea- 
tures (short description in Table 1) selected by RF. Looking 
at importance scores in Figure 4(a), we observe that: 


Highlight #4. The extent to which students anticipate con- 
tent consumption, the tendency of learning during week- 
ends, the proportion of watched videos, and the strength of 
their performance in quizzes, had the highest importance. 


Figure 4(b) shows that the difference in importance across 
features is more evident in the first weeks. ‘This finding 
emphasizes the fact that selecting appropriate features is 
more crucial when interested in very early predictions. 


5. EARLY PREDICTORS OVER COURSES 


Our exploratory analysis revealed interesting patterns on 
the predictive power and importance of a range of features. 
However, it remains under-explored the extent to which the 
patterns identified in that flipped course hold also in courses 
with other structures and educational settings. 'To this end, 
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(a) Average feature importance across weeks. 
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(b) Feature importance over weeks. 


Figure 4: [LA-Flip]. Importance of the best nineteen features selected by a RF classifier from the ensemble of all feature sets. 
Four features of our set have been selected as important. Table 1 lists the Feature IDs and the short description of each feature. 


Table 1: [LA-Flip]. Description of the most important nineteen features selected by a RF classifier from the ensemble of all 
feature sets, showed in decreasing order of importance. Four features of our set have been selected among the top eleven. 


ID | Set Name 
fo | Ours Competency Anticipation 


Short Description 


fi | LalleConati WeeklyPropWatched-Avg 
f2 | ChenCui RatioClicksWeekendDay 
f3 | Ours Content Anticipation 

fa | Ours CompetencyStrength 


fs | BoroujeniEtAl RegWeeklySim-M2 
fe | LalleConati WeeklyPropInter-Std 


fz | WanEtAl NumSubmissionCor 

fs | WanEtAl NumSubmissions-Avg 

fo | LalleConati Weekly PropInter-Avg 
fio | Ours StudentShape 


fir | WanEtAl 
fi2 | LalleConati 
fig | LalleConati 
fia | AkpinarEtAl QCheck-QCheck- V Load 
fis | AkpinarEtAl VPlay-VPause-VLoad 
fie | BoroujeniEtAl RegPeriodicity-M3 

fiz | BoroujeniEtAl RegWeeklySim-M1 

fis | AkpinarEtAl VStop-PCheck-VLoad 


Weekly PropReplayed-Avg 


we extended our analysis to a flipped course in a different do- 
main (Functional Programming, only video data in pre-class 
activities), a MOOC in the same domain (Linear Algebra, 
both videos and quizzes), and a MOOC from a different do- 
main (Functional Programming, only video interactions). 


5.1 Experimental Setup 


Protocol. In this experiment, we followed the steps described 
in Section 4.2.1, with few exceptions. Specifically, for each 
data set, we considered only classifiers trained with the opti- 
mal ensemble of all features proposed in prior work plus the 
ones proposed in this paper. To obtain the input features 
for the RF classifier on week w, we computed the weekly 
features for all feature sets; then averaged features of the 
same week, and finally averaged across weeks till week w. 
For each course, we computed the most important features 
from the ensemble (eight existing sets and ours) based on the 
average importance of the features across folds and weeks. 
The study was approved by the university’s ethics commit- 
tee (HREC No. 096-2020/09.04.2020). 


Data Set: FP-Flip. We consider one stream of a Functional 
Programming course taught to EPFL Master’s students in a 
flipped manner for 10 weeks. ‘The preparatory work included 


The extent to which the student approaches soon a quiz provided in subsequent weeks. 

The proportion of videos the student watched, counting repeating segments. 

The ratio between clicks happened during weekend and weekdays. 

The extent to which the student approaches soon a video provided in subsequent weeks. 

The extent to which a student passes a quiz getting the maximum grade with a low number of trials. 
The extent to which the student has a similar distribution of workload among weekdays across weeks. 
The standard deviation of the time the student spent while interrupting a video, across videos. 

The average number of quizzes attempted and correct. 

The number of submissions required to pass a quiz, on average. 

The average time the student spent while interrupting a video, across videos. 

The extent to which the student receives the maximum grade in quizzes at the first attempt in a row. 
NumSubmissionPerCorrect Percentage of the correct quiz submissions with respect to the total submissions. 

The proportion of videos the student re-watched, not counting repeating segments. 
FrequencyEvent-VideoPlay The frequency of the video play action in the students’ online sessions. 

The amount of times the student checks twice a given quiz and then go to load a video. 

The amount of times the student plays a video, pause and then load the next one. 

The extent to which the daily study pattern is repeating over weeks (e.g., same days of the week). 
The extent to which the student works on the same weekdays. 

The amount of times the student stops a video, attempts a quiz and then load the next video. 


a list of videos from a Functional Programming MOOC. Re- 
peating students were filtered out. Being a Master’s course 
with a failing percentage of only 5%, we considered whether 
a student’s final course grade (lying between 0 and 6) was 
above the average grade over all students as a success label. 
The dataset consists of clickstreams from 218 students, with 
38% of them being below average. 


Data Set: LA-MOOC. The content used in pre-class activi- 
ties within LA-Flip was also provided by EPFL instructors 
on an external MOOC platform in form of three separate 
MOOGCs, with the first MOOC being equivalent to the first 
4 weeks of the flipped course, the second MOOC equivalent 
to week 5 to week 8, and the third MOOC equivalent to 
the last 3 weeks. Given that the first 4 weeks of LA-F lip 
were delivered in a traditional manner, we excluded the first 
MOOC from our study. We also excluded the third MOOC, 
given that the number of enrolled students was barely small. 
To sum up, our study in this paper considers only the sec- 
ond MOOC that covers the second part (weeks 5 to 8) of the 
flipped course. To pass the course, it is mostly necessary to 
obtain at least 60% of the total points for each assignment. 
Hence, we used this rule as a way to measure success in our 
study. The final data set consists of clickstream data from 
170 students, with 33% of them failing the course. 
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Table 2: Features selected as important by RF classifiers for the ensemble of features for each course. 


Set Name Short Description LA-MOOC  FP-MOOC 
QCheck-QCheck-QCheck The amount of times the student checks three times the same quiz. v 
QCheck-QCheck-VLoad The amount of times the student checks twice the quiz and then go to load a video. Jv 
VPlay-VPlay-VPlay The amount of times the student clicks for three consecutive times on play for three different videos. v 
AkpinarEtAl VPlay-VPause-V Load The amount of times the student plays a video, pause and then load another one. J 
VPlay-QCheck-QCheck The amount of times the student plays a video, then checks twice a quiz. v 
VPlay-VStop-V Play The amount of times the student plays a video, stops, and plays another one. v 
VPause-VSpeedChange-VPlay The amount of times the student pauses a video, changes the speed, and re-plays it. v 
VStop-VPlay-VSeek The amount of times the student stops a video, then re-play it and seek to a given part. v 
VStop-VCheck-V Load The amount of times the student stops a video, checks a quiz and then load another video. Jv 
DelayLecture The average delay in viewing video lectures, as soon as they are released. v v Wf 
RegWeeklySim-M1 The extent to which the student works on the same weekdays across weeks. v v 
RegWeeklySim-M2 The extent to which a student has a similar distribution of workload among weekdays across weeks. J v v 
BoroujenieeAl RegWeeklySim-M3 The extent to which the time spent on each day of the week is similar for different weeks of the course. v 
RegPeriodicity-M1 The extent to which the hourly pattern of student’s activities is repeating over days. v 
RegPeriodicity-M3 If the daily study pattern is repeating over weeks (e.g. is active on Monday and Tuesday in every week). J v 
RegPeakTime-M1 The extent to which students’ activities are centered around a particular hour of the day. v 
RegPeakTime-M2 The extent to which students’ activities are centered around a particular day of the week. v 
RatioClicksWeekendWeekdays ‘The ratio between clicks in weekdays and weekends. J v v WA 
TimeSession-Avg The average amount of time spent from a login to the end of the session. v 
ChenCui TimeSession-Std The standard deviation of time spent from a login to the end of the session. v 
TimeBetweenSessions The average amount of time passed between two sessions for a student. v 
TotalClicks-Weekdays The number of clicks performed by a student over weekdays. v 
PauseDuration-Avg The average amount of time spent in pause while interacting with a video. 
SeekLength-Std The extent to which the seek length varies across videos. 
PauseDuration-Std The extent to which the pause duration varies across videos. 
TimeSpeedingUp-Avg The average amount of time spent with higher than 1x speed while playing a video. 
TimeSpeedingUp-Std The extent to which the time spent speeding up higher than 1x the videos varies. 
LalleConati Weekly PropWatched-Avg The proportion of videos the student watched, counting repeating segments. J 
WeeklyPropInter-Avg The average time the student spent in interrupting a video. J 
WeeklyPropInter-Std The deviation of the time the student spent in interrupting a video. v 
WeeklyPropReplayed-Avg The proportion of videos the student re-watched, counting repeating segments. v 
WeeklyPropReplayed-Std The deviation of the proportion of videos the student re-watched, counting repeating segments. v 
Frequency Event-VideoPlay The frequency of the play event in the students’ sessions. v v 
MubarakEtAl SpeedPlayBack-mean The average speed the student used to play back a video. v v 
NumSubmissionsCor The number of quizzes attempted and correct. v 
WanEtAl NumSubmissions-Avg The number of submissions performed for a quiz, on average. v v 
NumSubmissionPerCorrect The percentage of the correct quiz submissions with respect to the total submissions. J v 
NumSubmission Distinct The total number of distinct problems attempted by the student. v 
Competency Anticipation The extent to which the student approaches soon a quiz provided in subsequent weeks. WA 
Content Anticipation The extent to which the student approaches soon a video provided in subsequent weeks. J 
Ours CompetencyStrength The extent to which a student passes a quiz getting the maximum grade with a low number of trials. WA v 
StudentShape The extent to which the student receives the maximum grade in quizzes at the first attempt in a row. A v 
Student Speed The average amount of time passed between two submissions for the attempted quizzes. v 
10 . . . 
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Figure 5: AUC scores per week for RF classifiers trained on 
feature ensembles. Flipped courses (*-Flip) last 10 weeks; 


LA-MOOC (FP-MOOC) last 4 (6) weeks. 


Data Set: FP-MOOC. The content delivered in pre-class 
activities in FP-Flip was also provided by EPFL instruc- 
tors on an external MOOC platform in form of two sepa- 
rate MOOCs, with the first MOOC being equivalent to the 
first 6 weeks of the flipped course and the second MOOC to 
the subsequent weeks. No data was available on the second 
MOOC, so we limited our study to only the first MOOC 
(week 1 to 6 of the flipped period of FP-Flip). To pass this 
MOOC, 80% of the total points for each of the five graded 
assignments are mostly needed. Hence, we used this rule to 
measure success in our study. The dataset consists of click- 
streams from 3,565 students, with 52% failing the course. 


5.2 Observations 

We evaluated the predictive performance of a RF classifier 
across weeks for each course for the best ensemble feature 
set for that course. Figure 5 illustrates the predictive per- 
formance across weeks for all four courses. Considering the 
same course across different settings (flipped or MOOC), it 
can be observed that RFs trained on flipped course data 


part of the same two courses, highlighting again the high 
dependency from the educational setting. 


In a second part of this experiment, we analyzed the av- 
erage importance across weeks of the features selected by 
RFs across courses. Table 2 shows for each feature set and 
course, whether a given feature has been selected by the cor- 
responding RF classifier. It should be noted that this table 
includes only features picked at least by a RF classifier across 
courses. In general, we show that while there is some overlap 
between the optimal features across courses, the importance 
of the features highly depends on the setting and structure of 
the course. The ratio of clicks between weekends and week- 
days (ChenCui - RatioClicksWeekend Weekdays) is selected 
by all classifiers in all settings. Other features with a good 
level of generalizability are represented by those measuring 
regularity (BoroujeniEtAl). The other features were picked 
according to the setting or the structure of the course. In 
particular, RFs trained on LA-Flip and LA-MOOC assigned 
a higher importance to features that measure behavior in 
quizzes (e.g., Ours or WanEtAl). Hence, we can conclude 
that when available, features on quizzes are frequently se- 
lected, regardless of the setting. For courses with no quizzes, 
namely FP-Flip and FP-MOOC, the predictive power of RFs 


158 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


is mainly based on regularity and fine-grained video usage 
(e.g., features on time spent in a video, e.g., LalleConati). 


Highlight #5. When quizzes are included in the schedule, 
quiz-related features are frequently selected as important. 
This 1s stronger in flipped than MOOC settings. When only 
videos are available, the predictive power mainly derives 
from regularity and fine-grained video-related features. 


For the same course in different settings, namely LA-Flip VS 
LA-MOOC and FP-Flip VS FP-MOOC, the optimal feature 
set heavily changed. In LA courses, quiz-related features 
were more important in the flipped context, while session- 
based features were more important in MOOCs (e.g., those 
from ChenCui). The latter finding holds for FP courses as 
well. Specifically, RFs trained on the MOOC version con- 
sistently selected features related to the students’ session. 
Another observation for FP is that in the flipped version, 
tri-grams (ApkinarEtAl) and fine-grained video usage fea- 
tures (LalleConati) were picked; in the MOOC, regularity 
and session-based features were more important. To sum 
up, according to Table 2: 


Highlight #6. Predictors in flipped settings often rely on 
features based on tri-grams and fine-grained video consump- 
tion. Conversely, predictors in MOOCs consider regularity 
and session-based features as important. Quiz-related fea- 
ture are picked in both settings, when quizzes are available. 


6. DISCUSSION 


In this section, we connect the main findings coming from 
the individual experiments and present the implications and 
limitations of our study in the early success prediction task. 


Course-Related Observations. A challenge, as our work shows, 
lies on the generalizability of feature predictive power across 
courses. The variability of the results when repeating the 
exact same experiment with data from different courses (or 
slightly different settings) is very high. It is therefore chal- 
lenging to understand when, why, and how a feature tested 
on a given course could be re-used for other courses. 


Highlight #7. The predictive power of features does not of- 
ten generalize across courses with different structures and 
educational settings. This observation is stronger with re- 
spect to the courses structure than between flipped and 
MOOC settings. 


This observation affects the scalability of early predictors. 
Being so course-dependent, identifying and enabling fea- 
tures predictive of student success for a given course can 
take hours or days, given that the intellectual and experi- 
mental work needs to be replicated on courses, case by case. 


Highlight #8. The lack of feature predictive power gener- 
alizability questions the extent to which a feature can be 
scaled across courses with the same structure/setting. 


Our experiments also showed that including quizzes in pre- 
class activities leads to substantial improvements in effec- 
tiveness. Hence, success prediction is driven by complex re- 
lationships between students’ characteristics and the course 
domain, structure, and educational setting. 


Data-Related Observations. Research in the area of early 
success prediction is often conducted on data extracted from 


online activities only. Even in our case study (for LA-flip), 
we could not rely on data collected in class, missing an im- 
portant segment of learning. Moreover, clickstreams in this 
study do not cover other relevant interactions such as those 
in forums. In flipped courses, most (non-digitalized) dis- 
cussions happen in class, and the forum is mainly used by 
teachers for announcements. 


Highlight #9. Early success prediction in flipped courses 
would benefit from including data coming from offline ac- 
tivities (e.g., in class). 


Workflow-Related Observations. ‘Io establish reproducibil- 
ity, the description of the proposed features should go be- 
yond plain-text only. Our formulation in this paper can be 
re-used to define features as formulas, making it easier to 
replicate them, especially when no source code is provided. 


Highlight #10. Feature descriptions can be accompanied 
by their mathematical formulation to ease reproducibility. 
When possible, sharing the code can facilitate their re-use. 


Though we validated the current features on RFs, other 
classifiers were not presented. However, RF's often provide 
the best trade-off between effectiveness and interpretability 
(the latter was fundamental for our study) and our frame- 
work makes it easy to run this analysis on other classifiers. 
Given that other classifiers (e.g., Support Vector Machines) 
gave worse (or comparable) results in the preliminary exper- 
iments we ran, our results depict a valid picture of feature 
predictive power. 


7. CONCLUSIONS AND FUTURE WORK 


In this paper, we analyzed recent features for early success 
prediction in flipped and online courses. First, we inves- 
tigated the predictive power of eight existing feature sets 
and a novel feature set proposed in this paper on a flipped 
course. We benchmarked the predictive power of features 
using a RF classifier, and discussed the ensemble feature set 
optimal for that course. We then extended our analysis to 
courses with other settings (MOOCs), domains, and struc- 
tures, showing that the optimal ensemble and its predictive 
power vary. Our work calls for generalizable early predictors 
across courses with different characteristics. To promote re- 
search in this field, we also publicly release the source code 
developed during our study (see the footnote in Section 1). 


In future work, we plan to extend our analysis to other fea- 
tures (e.g., based on in-class data), and types of student 
success tasks (e.g., grade prediction). We also plan to ana- 
lyze more advanced classifiers and to devise robust classifiers 
across courses before testing them in the real world. 
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